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Abstract 

This paper deals with the problem of estimating the level sets L(c) = {F(x) > c}, with c G (0,1), 
of an unknown distribution function F on A plug-in approach is followed. That is, given a 

consistent estimator F n of F, we estimate L(c) by L n (c) = {F n (x) > c}. We state consistency results 
with respect to the Hausdorff distance and the volume of the symmetric difference. These results 
can be considered as generalizations of results previously obtained, in a bivariate framework, in Di 
Bernardino et al. (2011). Finally we investigate the effects of scaling data on our consistency results. 

Keywords: Level sets, multidimensional distribution function, plug-in estimation, Hausdorff 
distance. 



Introduction 

In this present paper, we consider the problem of estimating the level sets of a d-variate distribution 
function. To this aim, we generalize the results obtain in a previous paper (Di Bernardino et al., 2011). 

As yet remarked in Di Bernardino et al. (2011), considering the level sets of a distribution function, 
the commonly assumed property of compactness for these sets is no more reasonable. Then, differing 
from the classical literature (Cavalier, 1997; Cuevas and Fraiman, 1997; Bafllo et al, 2001; Bai'llo, 
2003; Cuevas et al, 2006; Biau et al, 2007; Laloe, 2009), we need to work in a non-compact setting 
and this requires special attention in the statement of our problem. 

We follow the same general approach than in Di Bernardino et al. (2011), and we will keep as much 
as possible the same notation. Considering a consistent estimator F n of the distribution function F, 
we propose a plug-in approach (e.g. see Bafllo et al, 2011; Rigollet and Vert, 2009; Cuevas et al., 
2006) to estimate the level set 

L(c) = {xeR%: F(x) > c}, for c € (0, 1), 

by 

L n (c) = {x£R%: F n (x) > c}, for c G (0, 1). 

The regularity properties of F and F n as well as the consistency properties of F n will be specified in 
the statements of our theorems. 
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As in Di Bernardino et al. (2011) our consistency results are stated with respect to two criteria of 
"physical proximity" between sets: the Hausdorff distance and volume of the symmetric difference. If 
the consistency in term of the Hausdorff distance is a trivial extension of Theorem 2.1 in Di Bernardino 
et al. (2011) (see Theorem 12 . 1 1 below) , things are a little more complex for the volume of the symmetric 
difference. In particular, in this latter case the convergence rate suffers from the well-known curse of 
dimensionality (see Theorem 13. 1|) . 

A second aim of this paper is to analyze the effects of scaling data on our consistency results (see 
Theorem ETTJ) . 

The paper is organized as follows. We introduce some notation, tools and technical assumptions in 
Section CD Consistency and asymptotic properties of our estimator of L(c) are given in Sections [2] and 
[3j Section 0] is devoted to investigate the effects of scaling data on our consistency results. Finally, 
proofs are postponed to Section [5J 



1. Notation and preliminaries 

In this section we introduce some notation and tools which will be useful later. 

Let N* = N \ {0}, R* = R+ \ {0} and R^_* = R^_ \ {0}. Let T be the set of continuous distribution 
functions R'J. — > [0, 1] and X := (X\,X2, ■ ■ ■ ,Xd) a random vector with distribution function F G T . 
Given an i.i.d sample {Xj}™ =1 in Mi with distribution function F G F, we denote by F n an estimator 
of F based on this finite sample. 

Define, for c G (0, 1), the upper c-level set of F G T and its plug-in estimator 

L(c) ={iGR|: F(x) > c}, L n (c) = {x G R^ : F n (x) > c}, 

and 

{F = c} = {x G R+ : F(x) = c}. 

In addition, given T > 0, we set 

L(cf = {x G [0,T] d : F(x) > c}, L n (c) T = {x G [0,T] d : F n (x) > c}, 

{F = c} T = {x G [0,T] d : F(x) = c}. 
Given a set A C R^. we denote by dA its boundary, and by j3 A the scaled set {/3 x, with x G ^4}. 

Note that, in the presence of a plateau at level c, {F = c] can be a portion of quadrant R^ instead 
of a set of Lebesgue measure null in R^ . In the following we introduce suitable conditions in order to 
avoid this situation. 

We denote by B(x,p) the closed ball centered on x G WL and with positive radius p. Let 
B(S,p) = \J x£S B(x,p), with S a closed set of R^_. 
For r > and ( > 0, define 

E = B({x G R+ : | F(x) -c\< r}, Q, 
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and, for a twice differentiable function F, 

m v = inf \\(VF) X \\, M H = sup \\{HF) X \\, 

X ^ E x&E 

where (VF) X is the gradient vector of F evaluated at x and || (V^)^ || its Euclidean norm, [HF) X the 
Hessian matrix evaluated in x and IK-fTF)^ its matrix norm induced by the Euclidean norm. 

For sake of completeness, we recall that if A\ and A2 are compacts sets in WL, the Hausdorff distance 
between A\ and A2 is defined by 

dn(Ai, A2) = max < sup d(x,A 2 ), sup d{x,A\) > , 
[xeAi x&A 2 J 

where d(x, A2) = vai y& A 2 || x — y ||. 

The above expression is well defined even when A\ and A2 are just closed (not necessarily compacts) 
sets but, in this case, the value djj{Ai, A2) could be infinity. Then in our setting, in order to avoid 
these situations, we introduce the following assumption. 

H: There exist 7 > and A > such that, if 1 1 - c | < 7 then V T > such that {F = c} T ^ 
and {F = t} T ^ 0, 

d H ({F = c} T ,{F = t} T ) < A\t — c\. 

For further details about this assumption the interest reader is referred to Di Bernardino et al. (2011), 
Cuevas et al. (2006), Tsybakov (1997). Remark that a sufficient condition for Assumption H can be 
obtained in terms of the differentiability properties of F. Proposition 11.11 below is a trivial extension 
in d— variate setting of Proposition 1.1 in Di Bernardino et al. (2011). 

Proposition 1.1 Let c G (0, 1). Let F £ T be twice differentiable on . Assume there exist r > 0, 
C > such that m v > and Mh < 00. Then F satisfies Assumption H, with A = 

Remark 1 Under assumptions of Proposition II. 1\ {F = t) is a set of Lebesgue measure null in W\. 
Furthermore we obtain dL(c) T = {F = c} T = {F = c} D [0, T] d (we refer for details to Remark 1 in 
Di Bernardino et al, 2011 and Theorem 3.2 in Rodriguez-Casal, 2003). 



2. Consistency in terms of the Hausdorff distance 

In this section we study the consistency properties of L n (c) T with respect to the Hausdorff distance 
between dL n (c) T and dL(c) T . 

From now on we note, for n £ N*, 

\\F-F n \\oo= sup I F{x) - F n (x) I, 

xeRi 
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and for T > 

\\F-F n \\Z > = sup | F(x) - F n (x) | . 

x£ [0,T] d 

The following result can be considered a trivially adapted version of Theorem 2.1 in Di Bernardino et 
al. (2011). 

Theorem 2.1 Let c 6 (0, 1). Lei F G J 7 be twice differentiable on R+*. Assume that there exist r > 0, 
C > suc/i i/iai m v > and M# < oo. Let T x > snc/i tfiai /or a// t : | i - c | < r, dL(t) Tl / 0. 
Let (T n ) ngN „ 6e an increasing sequence of positive values. Assume that, for each n and for almost all 
samples of size n, F n is a continuous function and that 

\\F - FnWoo -)■ 0, a.s. 

Then, for n large enough, 

d H (dL(c) T ",dL n (c) T ") < 6 A \\F - F n \\% a.s., 
where A = ^y. Therefore we have 

d H (dL(cf-,dL n (c) T ") = 0(\\F-F n \\ oo ) a.s. 

Under assumptions of Theorem 12.11 dH(dL(c) Tn ,dL n (c) Tn ) converges to zero and the quality of our 
plug- in estimator is obviously related to the quality of the estimator F n . For comments and discussions 
about this result we refer the interested reader to Remark 2 in Di Bernardino et al. (2011). 



3. L\ consistency 

The previous section was devoted to the consistency of L n (c) in terms of the Hausdorff distance. We 
consider now another consistency criterion: the consistency of the volume (in the Lebesgue measure 
sense) of the symmetric difference between L(c) Tn and L n (c) Tn . This means that we define the distance 
between two subsets Ai and A2 of M^" by 

d x (A 1 ,A 2 ) = X(A 1 AA 2 ), 
where A stands for the Lebesgue measure on M. d and A for the symmetric difference. 

Let us introduce the following assumption: 
Al There exist positive increasing sequences (v n ) ng N* an d (2~n) ngN , such that 

v n [ \F-F n \P X(dx) 4 0, 



for some 1 < p < 00. 



We now establish our consistency result with convergence rate, in terms of the volume of the symmetric 
difference. We can interpret the following theorem as an extension of Theorem 3.1 in Di Bernardino 
et al. (2011), in the case of a d— variate distribution function F. 
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Theorem 3.1 Let c G (0, 1). Let F G T be a twice differentiable distribution function onR+ . Assume 
that there exist r > 0, £ > such that m v > and Mh < oo. Assume that for each n, with probability 
one, F n is measurable. Let (f n ) ngN * and (T n ) nGN „ positive increasing sequences such that Assumption 
Al is satisfied and that for all t : \ t — c | < r, dL(t) Tl / 0. Then, it holds that 

p n d x (L(c) T ",L n (c) T ") 4 0, 

n— yoo 
/ _j_ (d-l)p 

with p n an increasing positive sequence such that p n = o I vfi +1 /T n p+1 



The proof is postponed to Section G3 This demonstration is basically based on the proof of Theorem 
3.1 in Di Bernardino et al. (2011). 

Theorem 13.11 provides a convergence rate, which is closely related to the choice of the sequence T n . 
Note that, as in Theorem 3 in Cuevas et al. (2006), Theorem 13. II above does not require any continuity 
assumption on F n . Furthermore, as in Theorem 3.1 in Di Bernardino et al. (2011), we remark that a 
sequence T n , whose divergence rate is large, implies a convergence rate p n quite slow. Moreover, this 
phenomenon is emphasized by the dimension d of the data, and we face here the well-known curse of 
dimensionality. In the following we will illustrate this aspect by giving convergence rate in the case 
of the empirical distribution function (see Example [1]) . Firstly, from Theorem 13.11 we can derive the 
following result. 

Corollary 3.1 Under the assumptions and notations of Theorem I3.il Assume that there exists a 

p 

positive increasing sequence (u n )neN* such that v n \\F — -F^Hoo — > 0. Then, it holds that 



p n d x {L(c) T \L n (c) T ") A 0, 

n— >oo 



( v d+(d-l)p 
V n i>+ X jT n P+ 1 

This result comes trivially from Theorem 13.11 and the fact that that v n \\F — -F n ||oo implies 

n— >oo 

C P 

V p > 1, w n \F-F n \P A(ds) 4 0, with w n = 

J[o,T n ]d r« 

Let us now present a more practical example. 
Example 1 (The empirical distribution function case) 

p 

Let F n the d—variate empirical distribution function. Then, it holds that v n \\F — -F n ||oo — > 0, with 
v n = o(y / n). From Theorem \3.1\ with p = 2, we obtain for instance: 

( n 1 / 3 \ / n 1 / 3 \ 

Pn = o\—jj^y for d = 3; Pn = I I for d = 4. 



The next section is dedicated to study the effects of scaling data. 
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4. About the effects of scaling data 

Suppose now to scale our data using a scale parameter a G M+. In our case, the scaled random 
vector will be (a Xi,a X2, ■ ■ ■ ,aXd) := oX. From now on we denote F a x (resp. Fx) the distribution 
function associated to oX (resp. to X). Using notation of Section [TJ let 

L a (c) = {xeR d + :F aX {x) > c}. 

It is easy to prove (see for instance Section 3 in Tibiletti, 1993) that 

L a (c) = aL(c), 

and 

E a = B{{x £R d + :\ F a x{x) - c |< r}, () = aE. 

Define now 



ml = inf \\VF a jc(x) 



x&E a 

First, we can obtain the following result whose proof is postponed to Section [SJ 
Lemma 4.1 It holds that 

m y = -m y , VoGt*. 



"a 

a 



Furthermore, if 



M H = sup \\{HF x )x\\ < +00 then M H , a = sup \\(HF aX ) x \\ < +00, with a 6 R* + . 

x£E x£aE 

We can now consider the effects of scaling data on Theorem 12.11 and 13.11 
Theorem 4.1 

1. Under same notation and assumptions of Theorem \2.1\ for n large enough, it holds that 

d H (dL a (c) aT ",dL nta (c) aT ") <6Aa\\F - F n \\%, a.s. 

2. Under same notation and assumptions of Theorem \3.1\ it holds that 

Pn,ad X (L a {c) aT \L n , a (c) aT ") A 0, 

n— >oo 

( 1 / dp 

with p ni a an increasing positive sequence such that p n ^ a = o I v% +1 / I ap +1 T n p+1 
Remark 2 

1 . The first result of Theorem 14.11 states that a change of scale of the data implies the same change 
of scale for the Hausdorff distance. 

2. The second result states that a change of scale of the data implies a rate in 

,p/(p+i) n 

instead of 



o[vi +1 /(a d T^) 

(^/(r(^))^). 



So, we see logically that the scale factor a impacts the volume in M. d with an exponent d. 
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Conclusion 



Starting from previous results obtained in Di Bernardino et al. (2011), we propose in this paper a 
generalization to the estimation of level sets in the case of a d-variate distribution function. The 
consistency results are stated in term of Hausdorff distance and volume of the symmetric difference. 
We propose a rate of convergence for this second criterion. Moreover, we analyze the impact of 
scaling data on our results. As a future work, a complete simulation study and an R-package are in 
preparation. 

5. Proofs 

Proof of Theorem 13.11 

Under assumptions of Theorem 13.11 we can always take T\ > such that for all t : \ t — c | < r, 
dL(t) Tl ± 0. Then for each n, for all t : \ t - c\ < r, dL(t) Tn is a non-empty (and compact) set on 
R+. 

We consider a positive sequence £ n such that £ n — y 0. For each n ^ 1 the random sets 

n— >oo 

L(c) T - A L n (c) T \ Q £n = {x£ [0,T n ] d :\F-F n \<e n } and Q £n = {x £ [0,T n ] d : | F - F n |> e n ] are 
measurable and 

\(L(c) T " A L n (c) T ") = A(L(c) T " A L n (c) T " n Q e J + A(L(c) r " A L n (c) T " n Q £ J- 

Since L(c) Tn A L n (c) Tn n Q £n C {x £ [0,T n ] d : c — e n < F < c + e n } we obtain 

A(L(c) T " A L n (cf") < \({x e [0, T n ] d : c - e n < F < c + e n }) + A(Q £ „)- 

From Assumption H (Section [1]) and Proposition if 2 e n < 7 then 

d H (dL(c + s n ) T ",dL(c-e n ) T ") < 2e n A. 

From assumptions on first derivatives of F (see Assumption H and Proposition II. ip and Propriety 1 
in Imlahi et al. (1999), we can write 

A({x G [0, T n ] d : c - e n < F < c + e n }) < (2 e n A) dT d ~ l ■ 

If we now choose 

1 



n T d_1 
yn - L n 



we obtain that, for n large enough, 2e n < 7 and 

Pn A({x G [0, T n ] d : c — e n < F < c + e n }) -»• 0. 



n-4oo 



Let us now prove that p n A(Q £n ) — > 0. To this end, we write 

PnA(Q £n ) = p n l {xe[0Tn] d. lF _ Fn>en} X{dx) < -y / I F-F n | p A(dx). 

•/ En J[0,T„l d 



Take e n such that 



So, from Assumption Al in Section El we obtain p n X(Q £n ) 
choose e n that satisfies (P) and ([2]). Hence the result. □ 
Proof of Lemma 14.11 



First, we remark that 



Then, we obtain 



Fax(x) = F X 



X 



id- 1)1 



O.As Pn = o( vK +1 /T n p+1 we can 



inf 

x&a E 

inf 

x£a E 

— inf 

a x^E 

1 V 

— m . 

a 



—Fx. (-),...,— Fx. (- 

dx\ \aJ dxd \a 

1 / dFx fx\ dFx/x 
a \ dx\ \aJ ' ' dxd \a 

( dF *< x) dF *( x ) 
\ dxi ' ' dxd 



Second part of Lemma 14. II comes down from trivial calculus. Hence the result. □ 

Proof of Theorem 14.11 

Proof of 1 . 

Following the proof of Theorem 12.11 it holds that 



d H (dL a x{c) aT \dL n)a {cT T -) < sup | F X (-) - F n (- 

Using Lemma 14.11 and the fact that 



sup | F x 

x€[0,aTn] d X " 



F n 



sup I Fx{x) - F n {x) |, 

x€[0,Tn] d 



we get the result. □ 
Proof of 2. 

As in the proof of Theorem 13.11 and using same notation, we can write 

\({x e [0,aT n ] d :c-e n <F a x < c + e„}) < (2e n Aa) da d ~ l Tj 

If we now choose 

/ i A 



d — 1 >~pd — 1 
n 



Pn, aOj Ti 



d rpd-l 



(3) 



we obtain that for n large enough 2 e n < 7 and 

Pn,aKi x G [0,aT n ] d : c - e n < F aX < c + e n }) -)■ 0. 

n— yoo 

The second part of this demonstration is equal to proof of Theorem 13.11 Then we take e n such that 



Pn, a 



(4) 



Then, from Assumption Al, in Section[3l we obtain p n a \({x G [0,oT n ] : I F a -^—F a n |> e n }) — > 0. 

' n— >oo 

/ _L_ / _dp_ (d-i) P \\ 

As p nt a = o I Vn +1 / I a p+1 T n p+i we can choose e n that satisfies ([3|) and (jH). Hence the result. □ 
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